state-action value function
Risk-Averse Total-Reward Reinforcement Learning
Existing model-based algorithms for risk measures like the entropic risk measure (ERM) and entropic value-at-risk (EVaR) are effective in small problems, but require full access to transition probabilities. We propose a Q-learning algorithm to compute the optimal stationary policy for total-reward ERM and EVaR objectives with strong convergence and performance guarantees. The algorithm and its optimality are made possible by ERM's dynamic consistency and elicitability. Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function.
ResQ: A Residual Q Function-based Approach for Multi-Agent Reinforcement Learning Value Factorization
The factorization of state-action value functions for Multi-Agent Reinforcement Learning (MARL) is important. Existing studies are limited by their representation capability, sample efficiency, and approximation error. To address these challenges, we propose, ResQ, a MARL value function factorization method, which can find the optimal joint policy for any state-action value function through residual functions. ResQ masks some state-action value pairs from a joint state-action value function, which is transformed as the sum of a main function and a residual function. ResQ can be used with mean-value and stochastic-value RL. We theoretically show that ResQ can satisfy both the individual global max (IGM) and the distributional IGM principle without representation limitations. Through experiments on matrix games, the predator-prey, and StarCraft benchmarks, we show that ResQ can obtain better results than multiple expected/stochastic value factorization methods.
InterQ: A DQN Framework for Optimal Intermittent Control
Aggarwal, Shubham, Maity, Dipankar, Baลar, Tamer
In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system's state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at https://github.com/AC-sh/InterQ.
ResQ: A Residual Q Function-based Approach for Multi-Agent Reinforcement Learning Value Factorization
The factorization of state-action value functions for Multi-Agent Reinforcement Learning (MARL) is important. Existing studies are limited by their representation capability, sample efficiency, and approximation error. To address these challenges, we propose, ResQ, a MARL value function factorization method, which can find the optimal joint policy for any state-action value function through residual functions. ResQ masks some state-action value pairs from a joint state-action value function, which is transformed as the sum of a main function and a residual function. ResQ can be used with mean-value and stochastic-value RL.
Deterministic Trajectory Optimization through Probabilistic Optimal Control
Filabadi, Mohammad Mahmoudi, Lefebvre, Tom, Crevecoeur, Guillaume
This article proposes two new algorithms tailored to discrete-time deterministic finite-horizon nonlinear optimal control problems or so-called trajectory optimization problems. Both algorithms are inspired by a novel theoretical paradigm known as probabilistic optimal control, that reformulates optimal control as an equivalent probabilistic inference problem. This perspective allows to address the problem using the Expectation-Maximization algorithm. We show that the application of this algorithm results in a fixed point iteration of probabilistic policies that converge to the deterministic optimal policy. Two strategies for policy evaluation are discussed, using state-of-the-art uncertainty quantification methods resulting into two distinct algorithms. The algorithms are structurally closest related to the differential dynamic programming algorithm and related methods that use sigma-point methods to avoid direct gradient evaluations. The main advantage of our work is an improved balance between exploration and exploitation over the iterations, leading to improved numerical stability and accelerated convergence. These properties are demonstrated on different nonlinear systems.
A Distributed Approach to Autonomous Intersection Management via Multi-Agent Reinforcement Learning
Cederle, Matteo, Fabris, Marco, Susto, Gian Antonio
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.